04. Recovering all your systems

ND0063 C1 L4 04 Developing Your Intuition About X Video

Monitoring and responding are core to every vital system. When you architect a platform, you should always think about how you will know if something is wrong with that platform early on in the design process. There are many different kinds of monitoring that can be applied to many different facets of the system, and knowing which types to apply where it can be the difference between success and failure.

Always ask yourself how you would diagnose issues with an application, how would you understand it's health, what are it's choke points, how would you identify them and what would you do when something breaks. While thinking through these concepts is important, it is very difficult to foresee every possible scenario. This is why advanced organizations employ techniques like "chaos engineering" to intentionally cause breakage in their environments in a controlled manner. If you build a resilient system, it should be resilient, so why not terminate a random server? It may be hard to get accustomed to this idea, but it can provide insight that would otherwise be impossible to gain.

Practicing

Name one reason to monitor your non-production environments.

SOLUTION: To practice monitoring your production environments

Disrupt production

Should you ever intentionally disrupt a production service?

SOLUTION: Yes, chaos engineering is a strong practice